Quickstart#

Here, we’ll go through a very basic example of reconstructing, preprocessing, and visualizing 4D faces from video data using Medusa’s Python API. For more information about its command-line interface, check the CLI documentation!

We’ll use a short video to reconstruct, shown below:

import os  # need 'egl' for 'headless' rendering!
os.environ['PYOPENGL_PLATFORM'] = 'egl'
from IPython.display import Video

from medusa.data import get_example_video
vid = get_example_video()

# Show in notebook
Video(vid, embed=True)  

Reconstruction#

For this example, we’ll use the Mediapipe Face Mesh model to reconstruct the face in the video in 4D, that is, a 3D reconstruction for each frame of the video. We are going to use the high-level videorecon function from Medusa, which reconstructs the video frame by frame and returns a MediapipeData object, which contains all reconstruction (meta)data.

from medusa.recon import videorecon
data = videorecon(vid, recon_model='mediapipe', loglevel='WARNING')
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Great! Now let’s inspect the data variable. The reconstructed vertices are stored in the attribute v, a numpy array of shape \(T\) (time points) \(\times\ V\) (vertices) \(\times\ 3\) (X, Y, Z).

print("`v` is of type: ", type(data.v))
print("`v` has shape: ", data.v.shape)
`v` is of type:  <class 'numpy.ndarray'>
`v` has shape:  (232, 468, 3)

The the data contained in v represents, for each time point, the 3D coordinates of the vertices (also called “landmarks”) that describe the shape of the face. The particular mesh used by Mediapipe contains 468 vertices, but other reconstruction models may contain many more vertices (like FLAME, which contains 5023 vertices)!

To get an idea of the data, let’s just extract the 3D vertices from the first time point (i.e., the first frame of the video) and plot it. We need to do this in 2D, of course, so we’ll just a scatterplot to visualize the X and Y coordinates only:

import matplotlib.pyplot as plt
t0 = data.v[0, :, :]  # first time point
t0_x = t0[:, 0]
t0_y = t0[:, 1]

plt.figure(figsize=(6, 6))
plt.scatter(t0_x, t0_y)
plt.axis('off')
plt.show()
../_images/quickstart_9_0.png

A more appealing way to visualize the reconstruction is as a “wireframe”. Medusa allows you to do this for all time points, such that it creates a video of the full 4D reconstruction, and (optionally) rendered on top of the original video as well. To do so, you can use the render_video method that each data object in Medusa has.

We do this below. By setting the video parameter to the path of the video, we tell the render_video method to render the wireframe on top of the original video:

f_out = './example_vid_recon.mp4'
data.render_video(f_out, wireframe=True, video=vid)

# Show in notebook
Video('./example_vid_recon.mp4', embed=True)

That looks pretty good! However, there are two issues with the data as it is now. First, each vertex represents both “global” (rigid) movement (i.e., the face moving left/right/up/down and rotating) and “local” (non-rigid) information (i.e., facial expressions such as smiling and frowning). Second, part of these rigid movements seem to reflect noisy “jitter”, which are simply inaccuracies in the reconstruction.

Alignment#

We can separate global and local movement by aligning the reconstructions across time. Alignment, here, refers to the rotation and translation necessary to match the reconstructed vertices from each timepoint to a reference timepoint or template. To align a reconstructed 4D dataset, you can use the align function:

from medusa.preproc import align
data = align(data)

After alignment, the vertices now represent local movement only (as the global movement has been projected out). Let’s visualize the data again, to confirm that it only represents local movement:

f_out = './example_vid_recon.mp4'
data.render_video(f_out, wireframe=True, video=vid)

Video('./example_vid_recon.mp4', embed=True)

As you can see, the rotation (e.g., head tilt) and translation (moving sideways) has been projected out of the data! Importantly, after alignment, the alignment parameters are stored as a series of \(4 \times 4\) affine matrices (one for each timepoint) in the attribute mat:

# T (timepoints) x 4 x 4
print(data.mat.shape)
(232, 4, 4)

We can convert this matrix representation to a set of translation and rotation parameters (and shear and scale parameters, which we ignore for now) that are easier to interpret. To do this, you can use the decompose_mats method:

# Decompose affine matrices to movement parameters
motion_params = data.decompose_mats()

# Select translation and rotation only (ignore shear/scale)
motion_params = motion_params.iloc[:, :6]

# Show first five timepoints
motion_params.head()
Trans. X Trans. Y Trans. Z Rot. X (deg) Rot. Y (deg) Rot. Z (deg)
0 3.669785 -0.047592 -31.647424 -6.695855 1.881420 -0.956345
1 3.706018 -0.145137 -31.505644 -7.305673 1.583575 -1.297212
2 3.799119 -0.125926 -31.571524 -7.263515 1.272087 -1.211455
3 3.825383 -0.170431 -31.601822 -7.456793 1.712648 -1.059006
4 3.877166 -0.173477 -31.591321 -7.452857 1.515342 -1.165612

Just like the vertices, these parameters can be interpreted as timeseries representing the rigid movement of the face over time:

# Show movement relative to first timepoint
motion_params = motion_params - motion_params.iloc[0, :]
trans_params = motion_params.iloc[:, :3]
rot_params = motion_params.iloc[:, 3:]

fig, axes = plt.subplots(nrows=2, sharex=True, figsize=(12, 6))
axes[0].plot(trans_params)
axes[0].set_ylabel('Translation (in mm.)', fontsize=15)
axes[0].set_xlim(0, motion_params.shape[0])
axes[1].plot(rot_params)
axes[1].set_ylabel('Rotation (in deg.)', fontsize=15)
axes[1].set_xlabel('Frame nr.', fontsize=15)
axes[1].legend(['X', 'Y', 'Z'], frameon=False, ncol=3, fontsize=15)
fig.show()
../_images/quickstart_22_0.png

Temporal preprocessing#

Medusa contains several functions to further preprocess the 4D data. One functionality to highlight is temporal filtering, which you can use to filter out low and high-frequency noise, such as the “jitter” we observer earlier. The bw_filter (“butterworth filter”) implements a band-pass filter to do just this:

from medusa.preproc import bw_filter

# cut-off frequencies in Herz
data = bw_filter(data, low_pass=4, high_pass=0.005)

Let’s render the data again, which should now look a lot “smoother”:

f_out = './example_vid_recon.mp4'
data.render_video(f_out, wireframe=True, video=None)

Video('./example_vid_recon.mp4', embed=True)

There is a lot more functionality in Medusa, including different reconstruction models, additional preprocessing functions, and analysis options. A great way to explore this is to check out the tutorials!